A Method for Calculating Term Similarity on Large Document Collections

نویسندگان

  • Wolfgang W. Bein
  • Jeffrey S. Coombs
  • Kazem Taghva
چکیده

We present an efficient algorithm called the Quadtree Heuristic for identifying a list of similar terms for each unique term in a large document collection. Term similarity is defined using the Expected Mutual Information Measure (EMIM). Since our aim for defining the similarity lists is to improve information retrieval (IR), we present the outcome of an experiment comparing the performance of an IR engine designed to use the similarity lists. Two methods were used to generate similarity lists: a brute-force technique and the Quadtree Heuristic. The performance of the list generated by the Quadtree Heuristic was commensurate with the brute force list.

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

Clustering Documents with Large Overlap of Terms into Different Clusters based on Similarity Rough Set Model

Similarity rough set model for document clustering (SRSM) uses a generalized rough set model based on similarity relation and term co-occurrence to group documents in the collection into clusters. The model is extended from tolerance rough set model (TRSM) (Ho and Funakoshi, 1997). The SRSM methods have been evaluated and the results showed that it perform better than TRSM. However, in document...

متن کامل

Event-centric Document Similarity for Biomedical Literature

Identifying similar documents for a given query document helps users to explore large document collections. However, most existing techniques are based on the vector space model and handle documents only as bags of words. Thus, more complex information that can be used for calculating similarities is not taken into account. For example, events play an important role in the biomedical literature...

متن کامل

Automatic Construction of Evaluation Sets and Evaluation of Document Similarity Models in Large Scholarly Retrieval Systems

Retrieval systems for scholarly literature offer the ability for the scientific community to search, explore and download scholarly articles across various scientific disciplines. Mostly used by the experts in the particular field, these systems contain user community logs including information on user specific downloaded articles. In this paper we present a novel approach for automatically eva...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003